Large-scale pattern-based information extraction from the world wide web

نویسنده

Sebastian Blohm

چکیده

Extracting information from text is the task of obtaining structured, machineprocessable facts from information that is mentioned in an unstructured manner. It thus allows systems to automatically aggregate information for further analysis, efficient retrieval, automatic validation, or appropriate visualization. Information Extraction systems require a model that describes how to identify relevant target information in texts. These models need to be adapted to the exact nature of the target information and to the nature of the textual input, which is typically accomplished by means of Machine Learning techniques that generate such models based on examples. One particular type of Information Extraction models are textual patterns. Textual patterns are underspecified explicit descriptions of text fragments. The automatic induction of such patterns from example text fragments which are known to contain target information is a common way to learn this type of extraction models. This thesis explores the potential of using textual patterns for Information Extraction from the World Wide Web. We review and discuss a large body of related work by describing it within a common framework. Then, we empirically analyze the effects of a multitude of design choices in pattern-based Information Extraction systems. In particular, we investigate how patterns can be filtered appropriately. We show how corpora of different nature can be exploited beneficially and how the nature of the patterns influences extraction quality. Finally, we present new ways of mining textual patterns by modelling pattern induction as a well-understood type of Data Mining problems.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

EXTRACTION-BASED TEXT SUMMARIZATION USING FUZZY ANALYSIS

Due to the explosive growth of the world-wide web, automatictext summarization has become an essential tool for web users. In this paperwe present a novel approach for creating text summaries. Using fuzzy logicand word-net, our model extracts the most relevant sentences from an originaldocument. The approach utilizes fuzzy measures and inference on theextracted textual information from the docu...

متن کامل

Towards Large Scale Semantic Annotation Built on MapReduce Architecture

Automated annotation of the web documents is a key challenge of the Semantic Web effort. Web documents are structured but their structure is understandable only for a human that is the major problem of the Semantic Web. Semantic Web can be exploited only if metadata understood by a computer reach critical mass. Semantic metadata can be created manually, using automated annotation or tagging too...

متن کامل

Building Large Scale Relation KB from Text

Recently more and more structured data in form of RDF triples have been published and integrated into Linked Open Data (LOD). While the current LOD contains hundreds of data sources with billions of triples, it has a small number of distinct relations compared with the large number of entities. On the other hand, Web pages are growing rapidly, which results in much larger number of textual cont...

متن کامل

Weakly-Supervised Acquisition of Open-Domain Classes and Class Attributes from Web Documents and Query Logs

A new approach to large-scale information extraction exploits both Web documents and query logs to acquire thousands of opendomain classes of instances, along with relevant sets of open-domain class attributes at precision levels previously obtained only on small-scale, manually-assembled classes.

متن کامل